#Import Library
from scipy.stats import shapiro
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import norm
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
import statsmodels.api as sm
from statsmodels.tsa.api import SimpleExpSmoothing, Holt
import statsmodels.api as sm
import itertools
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn import metrics
from tabulate import tabulate
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
import plotly.express as px
import pycountry
import random
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
Tableau Public Link: https://public.tableau.com/app/profile/yusen.zhou/viz/DeltaProject_AD654/InjuryCauses
Tableau workbook also available as part of the assignment submission.
The first visualization is Injury per Year per Fix Port, a histogram showing the number of accidents reported yearly. The different colors represent whether the ride is portable or fixed. Based on the graph, the injury number increased year by year from 1988 to 2001 and dropped slightly after a peak in 2001, indicating park management improved overall ride safety. Although the majority of accidents occurred on fixed rides, the management team cannot ignore the safety of portable rides, as a small number of cases may be due to the low ownership of portable rides. The second histogram depicts the Age and Gender distribution. Most accidents were concentrated in the under-18 age group. Boys have a higher rate of accidents under the age of 7, but female players have a higher rate of 57.23% overall. Management must alert staff to pay special attention to children when operating rides and post warning signs for more dangerous equipment.
According to the third plot, the Device Category bar chart, the top 6 rides where accidents occur most are the water slide, coaster, go-kart, spinning, water ride, and car & track rides. Looking at the fourth graph together, the Accidents by Park pie chart, although water parks only account for 20.3% of the total accidents, it had two programs among the top 6 accident rides. The management team of the water park needs to establish additional rules to ensure the safety of visitors, such as children needing to be supervised by an adult companion before entering the water slide or water ride. Depending on the Accident versus the Park Responsibility line plot, approximately 5% - 15% of all accidents are attributable to the park, including employee error, machine operation error, and mechanical malfunction. Amusement parks should conduct routine inspections of their rides to ensure they are in good working order and are safe for visitors. Also, all park employees should be trained in safety procedures and know how to operate the rides properly.
Finally, the treemap shows the top-ranked injury types across all records. Most injury types may have been caused by themselves or by other visitors. For minor injuries, there should be first-aid stations throughout the park, staffed by qualified personnel, to provide immediate medical attention in case of an accident. Amusement parks should also have an emergency response plan that outlines how to respond to big emergencies and evacuate visitors in case of an emergency.
park_accidents = pd.read_csv("park_accidents.csv")
I. Accidents by State
state_counts = park_accidents.groupby("acc_state")["acc_id"].count().sort_values(ascending=False)
print(state_counts)
acc_state NJ 5646 CA 3407 TX 3320 FL 1070 PA 269 MI 265 OK 154 IL 152 WI 111 CO 107 OH 81 NY 69 MD 66 KY 37 NV 31 IA 14 NC 14 MA 13 WV 11 CT 11 IN 8 WA 8 MO 6 RI 5 AR 3 VA 2 TN 1 AZ 1 LA 1 GA 1 Name: acc_id, dtype: int64
II. Injury Causes
injury_causes = park_accidents.pivot_table(index="device_category", values="acc_id", aggfunc="count")
sorted = injury_causes.sort_values(by="acc_id", ascending=False)
print(sorted)
acc_id device_category water slide 3530 coaster 2748 spinning 1988 go-kart 1767 water ride 1163 cars & track rides 1025 aquatic play 465 other attraction 451 play equipment 403 pendulum 318 vertical drop 252 float attraction 187 wave device 178 inflatable 151 challenge activity 96 unknown 87 alpine activity 41 trampoline 33 laser tag 1
III. Mean Age by Accident Category
mean_age_by_category = park_accidents.pivot_table(index="category", values="age_youngest", aggfunc="mean")
sorted_mean_age= mean_age_by_category.sort_values(by="age_youngest", ascending=False)
print(sorted_mean_age)
age_youngest category Illness or neurological symptoms 38.382075 Illness: Seizure or LOC 25.573171 Hyperextension or dislocation 24.083333 Load/Unload: injured when vehicle moved 23.810000 Abrupt stop/drop/lurch 22.537037 Fall: patron fell off inner tube, mat or board 22.385816 Body pain (normal motion) 22.277075 Load/Unload: scrape or stumble 20.469062 Impact: hit wall or barrier at end of slide runout 19.470588 Restraint too tight 19.455882 Collision: patrons collided (participatory) 19.448649 Impact: person hit by ride 19.298701 Impact: hit something in participatory attraction 19.051142 Injured by foreign object 17.353774 Collision: operator-controlled vehicles 17.176056 Impact: vaginal or rectal injury 15.800000 Collision: patron-controlled vehicles 15.545954 Fall: patron fell from device (participatory) 15.545455 Impact: hit something within ride vehicle 15.482634 Injured in queue or exit 15.198502 Employee injured 14.975610 Entrapment or pinch-point 14.975410 Collision: patrons collided within vehicle 14.836364 Load/Unload: hit or pinched by restraint 14.054867 Impact: extremity hit something outside carrier 13.315789 Unscheduled stop 13.133641 Burn (includes friction burn) 12.693182 Collision: go-kart crashed (no further descript... 12.142857 Collision: go-kart or bumper car hit stationary... 11.783410 Fall: in climb or play area 11.777778 Other 11.578947 Environmental issue 10.600000 Fall: patron fell from seat, but not carrier 10.384615 Fall: ejection/fall from ride 10.281588 Seatbelt abrasion or bruising 9.725490 Choking, water inhalation, suffocation 8.821429 Electrical shock 7.766667 Derailment 6.977273 Equipment failure 4.515789 Unknown (not enough info) 3.156627 Awkward landing 0.000000
IV. Top 10 manufacturers of the faulty ride with the highest number of injured people
injured_by_manufacturer = park_accidents.groupby("manufacturer")["num_injured"].sum().sort_values(ascending=False)
top_10_manufacturers = injured_by_manufacturer.head(10)
print(top_10_manufacturers)
manufacturer 0 2936 In-house 1167 Intamin AG 795 WhiteWater West Industries Ltd. 769 Arrow Dynamics 681 Pro Slide 562 Zamperla 511 Chance Manufacturing, Inc. 458 Wisdom Industries, Ltd. 415 Surf Coaster 377 Name: num_injured, dtype: int64
Summary
According to the Exploratory Data Analysis performed, we found out that New Jersey has the highest number of accidents (5646) followed by California and Texas. In this case, park management in these states should be more particularly vigilant about safety measures and accident prevention.
Among device categories, water slides account for the highest number of accidents (3530), followed by coasters and spinning rides. In this part, Park Management should be more focus on improving the safety of these attraction (such as optimizing the Stardard of Procedure) and Implement additional precautions.
In terms of, mean age by accident category, accidents involving "Illness or neurological symptoms" and "Illness: Seizure or LOC" predominantly affect older individuals, with mean ages of 38.38 and 25.57 years, respectively and categories such as "Derailment," "Electrical shock," and "Choking, water inhalation, suffocation" involve younger individuals with mean ages of 6.98, 7.77, and 8.82 years, respectively. In this case, park management can prioritize safety measures, staff training, and targeted communication efforts to prevent specific types of accidents in specific age.
Lastly, The top 10 manufacturers with the highest number of injured people include both well-known companies and in-house manufacturers. It is crucial for park management to meticulously examine the safety and maintenance records of attractions created by these manufacturers. They should assess whether additional safety measures are needed or even contemplate replacing rides from these manufacturers with ones from companies that have lower accident rates. It is important to highlight that while compiling summary statistics to identify the top 10 manufacturers of faulty rides, we discovered that the majority of accidents were associated with "0," which signifies a lack of data. Park management should take this issue seriously and investigate the source of these accidents, as they are connected to faulty rides.
1. Importing and Exploring Dataset
# Importing ski hotel dataset
hotel = pd.read_csv("ski_hotels.csv")
# Dataset Info
hotel.info()
# Dataset Head
hotel.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 407 entries, 0 to 406 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 407 non-null int64 1 country 407 non-null object 2 resort 407 non-null object 3 hotel 407 non-null object 4 price (£) 407 non-null int64 5 distance_from_lift_(m) 407 non-null object 6 altitude (m) 407 non-null int64 7 totalPiste (km) 407 non-null int64 8 totalLifts 407 non-null int64 9 gondolas 407 non-null int64 10 chairlifts 407 non-null int64 11 draglifts 407 non-null int64 12 blues 407 non-null float64 13 reds 407 non-null float64 14 blacks 407 non-null float64 15 totalRuns 407 non-null float64 16 link 407 non-null object 17 sleeps 407 non-null object 18 decSnowLow2020(cm) 407 non-null object 19 decSnowHigh2020(cm) 407 non-null object 20 janSnowLow2020(cm) 407 non-null object 21 janSnowHigh2020(cm) 407 non-null object 22 febSnowLow2020(cm) 407 non-null object 23 febSnowHigh2020(cm) 407 non-null object dtypes: float64(4), int64(8), object(12) memory usage: 76.4+ KB
| Unnamed: 0 | country | resort | hotel | price (£) | distance_from_lift_(m) | altitude (m) | totalPiste (km) | totalLifts | gondolas | chairlifts | draglifts | blues | reds | blacks | totalRuns | link | sleeps | decSnowLow2020(cm) | decSnowHigh2020(cm) | janSnowLow2020(cm) | janSnowHigh2020(cm) | febSnowLow2020(cm) | febSnowHigh2020(cm) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | italy | bardonecchia | residence-tabor | 550 | unknown | 1312 | 140 | 23 | 0 | 9 | 14 | 17.0 | 24.0 | 1.0 | 42.0 | https://www.igluski.com/ski-resorts/italy/bard... | unknown | 70 | 170 | 60 | 200 | 47 | 177 |
| 1 | 1 | italy | bardonecchia | residence-villa-frejus | 561 | unknown | 1312 | 140 | 23 | 0 | 9 | 14 | 17.0 | 24.0 | 1.0 | 42.0 | https://www.igluski.com/ski-resorts/italy/bard... | unknown | 70 | 170 | 60 | 200 | 47 | 177 |
| 2 | 2 | bulgaria | bansko | hotel-mura | 566 | 1100 | 935 | 70 | 24 | 1 | 7 | 16 | 5.0 | 8.0 | 1.0 | 14.0 | https://www.igluski.com/ski-resorts/bulgaria/b... | 140 | 0 | 70 | 5 | 75 | 5 | 77 |
| 3 | 3 | bulgaria | borovets | hotel-samokov | 574 | 75 | 1390 | 58 | 18 | 1 | 4 | 13 | 1.0 | 13.0 | 1.0 | 15.0 | https://www.igluski.com/ski-resorts/bulgaria/b... | 590 | 0 | 0 | 160 | 200 | 194 | 246 |
| 4 | 4 | bulgaria | bansko | hotel-lion---bansko | 596 | 800 | 935 | 70 | 24 | 1 | 7 | 16 | 5.0 | 8.0 | 1.0 | 14.0 | https://www.igluski.com/ski-resorts/bulgaria/b... | 386 | 0 | 70 | 5 | 75 | 5 | 77 |
2. Selecting only Numeric Variables for Clustering Analysis
hotel_cluster = hotel[['country','price (£)', 'distance_from_lift_(m)', 'altitude (m)','totalPiste (km)',
'totalLifts', 'gondolas', 'chairlifts', 'draglifts', 'blues', 'reds',
'blacks', 'totalRuns', 'sleeps', 'decSnowLow2020(cm)','decSnowHigh2020(cm)',
'janSnowLow2020(cm)', 'janSnowHigh2020(cm)','febSnowLow2020(cm)',
'febSnowHigh2020(cm)']]
3. Checking for dataset missing values and unrelevant value ("unknown")
# Checking for missing value
hotel_cluster.isna().sum()
country 0 price (£) 0 distance_from_lift_(m) 0 altitude (m) 0 totalPiste (km) 0 totalLifts 0 gondolas 0 chairlifts 0 draglifts 0 blues 0 reds 0 blacks 0 totalRuns 0 sleeps 0 decSnowLow2020(cm) 0 decSnowHigh2020(cm) 0 janSnowLow2020(cm) 0 janSnowHigh2020(cm) 0 febSnowLow2020(cm) 0 febSnowHigh2020(cm) 0 dtype: int64
No "missing" value detected, but looking at the head and unique function output, there are couples of "unknown" value in 'distance_fromlift(m)' and 'sleeps' variables.
# Checking for "unknown" value for each variables
unknown_counts = []
for col in hotel_cluster.columns:
if 'unknown' in hotel_cluster[col].unique():
unknown_count = hotel_cluster[col].value_counts()['unknown']
unknown_percent = (unknown_count / hotel_cluster.shape[0]) * 100
else:
unknown_percent = 0
unknown_counts.append((col, unknown_percent))
table = tabulate(unknown_counts, headers=['Column', 'Unknown %'], tablefmt='grid')
print(table)
+------------------------+-------------+ | Column | Unknown % | +========================+=============+ | country | 0 | +------------------------+-------------+ | price (£) | 0 | +------------------------+-------------+ | distance_from_lift_(m) | 47.1744 | +------------------------+-------------+ | altitude (m) | 0 | +------------------------+-------------+ | totalPiste (km) | 0 | +------------------------+-------------+ | totalLifts | 0 | +------------------------+-------------+ | gondolas | 0 | +------------------------+-------------+ | chairlifts | 0 | +------------------------+-------------+ | draglifts | 0 | +------------------------+-------------+ | blues | 0 | +------------------------+-------------+ | reds | 0 | +------------------------+-------------+ | blacks | 0 | +------------------------+-------------+ | totalRuns | 0 | +------------------------+-------------+ | sleeps | 23.5872 | +------------------------+-------------+ | decSnowLow2020(cm) | 13.7592 | +------------------------+-------------+ | decSnowHigh2020(cm) | 13.7592 | +------------------------+-------------+ | janSnowLow2020(cm) | 13.7592 | +------------------------+-------------+ | janSnowHigh2020(cm) | 13.7592 | +------------------------+-------------+ | febSnowLow2020(cm) | 13.7592 | +------------------------+-------------+ | febSnowHigh2020(cm) | 13.7592 | +------------------------+-------------+
Observed Variable with "Unknown" Values:
4. Checking the magnitude of "unknown" value to the dataset
a = hotel[['country', 'resort', 'hotel','distance_from_lift_(m)','sleeps']]
b = a[a['distance_from_lift_(m)'] == 'unknown']
# how many resorts in a specific country that has "unknown" distance from lift value"
print(b['country'].value_counts())
print("")
# how many resorts in a specific country in total
print(hotel['country'].value_counts())
print("")
# percentage of resorts that have unknown distance to lift value in each country
print(b['country'].value_counts()/hotel['country'].value_counts())
austria 85 italy 59 france 25 finland 14 andorra 8 bulgaria 1 Name: country, dtype: int64 austria 169 italy 105 france 77 andorra 29 finland 16 bulgaria 11 Name: country, dtype: int64 andorra 0.275862 austria 0.502959 bulgaria 0.090909 finland 0.875000 france 0.324675 italy 0.561905 Name: country, dtype: float64
To create the cluster, it would be better to exclude the "distance_fromlift(m)" and "sleeps" variables. These two variables are specific to each resort and country, making them unsuitable for imputation. On the other hand, the "snow" variable can be imputed because it's a naturally occurring phenomenon that can be predicted using available data.
It's not recommended to drop the observations/rows with "unknown" values because they constitute a significant portion of the data for each country
5. Dropping and Imputing Variables
# Dropping Variables
hotel_cluster.drop(columns=['distance_from_lift_(m)', 'sleeps'], inplace=True)
# loop over column names to remove "unknown" values and convert to numeric
cols = ['decSnowLow2020(cm)', 'decSnowHigh2020(cm)', 'janSnowLow2020(cm)',
'janSnowHigh2020(cm)', 'febSnowLow2020(cm)', 'febSnowHigh2020(cm)']
snow = pd.DataFrame(columns = ['decSnowLow2020(cm)', 'decSnowHigh2020(cm)', 'janSnowLow2020(cm)',
'janSnowHigh2020(cm)', 'febSnowLow2020(cm)', 'febSnowHigh2020(cm)'])
for col in cols:
snow[col] = pd.to_numeric(hotel_cluster[col], errors='coerce')
#snow = snow.dropna(subset=[col])
snow['country'] = hotel['country']
# Taking the observation mean for each variable for imputation value
print(snow.groupby('country').mean())
decSnowLow2020(cm) decSnowHigh2020(cm) janSnowLow2020(cm) \
country
andorra 20.833333 47.777778 99.444444
austria 16.944444 73.783951 51.524691
bulgaria 0.000000 38.181818 57.272727
finland 35.818182 50.454545 71.363636
france 57.807018 117.526316 131.491228
italy 67.500000 187.391304 50.347826
janSnowHigh2020(cm) febSnowLow2020(cm) febSnowHigh2020(cm)
country
andorra 170.833333 78.444444 151.666667
austria 132.265432 51.679012 147.179012
bulgaria 110.909091 85.818182 140.000000
finland 77.727273 85.909091 100.454545
france 240.070175 126.771930 235.368421
italy 201.206522 58.630435 202.945652
# Converting snow information in the hotel_cluster dataset into numeric
for col in cols:
hotel_cluster[col] = pd.to_numeric(hotel_cluster[col], errors='coerce')
# Imputing Snow value for each of the missing value for each of the country
snow_mean = snow.groupby('country').mean()
for col in cols:
for idx, row in hotel_cluster.iterrows():
if row['country'] in snow_mean.index:
if pd.isna(row[col]) or row[col] == 'unknown':
hotel_cluster.loc[idx, col] = snow_mean.loc[row['country'], col]
# Removing the intermediary 'country' variable
hotel_cluster.drop(columns=['country'], inplace=True)
6. Scaling the Data
# Scaling the hotel_cluster dataframe by standardization and assigning it to new variable named scaled_hotel_cluster
scaler = preprocessing.StandardScaler()
scaled_hotel_cluster = scaler.fit_transform(hotel_cluster)
scaled_hotel_cluster = pd.DataFrame(scaled_hotel_cluster)
# Changing the colnames to its original value
scaled_hotel_cluster.columns = hotel_cluster.columns
7. Constructing Elbow Chart
# Initializing empty list for SSE calculation
sse = {}
for k in range(1, 11):
# Initialize KMeans with k clusters
kmeans = KMeans(n_clusters=k, random_state=296)
# Fit KMeans on the normalized dataset
kmeans.fit(scaled_hotel_cluster)
sse[k] = kmeans.inertia_
# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')
# Add X-axis label "k"
plt.xlabel('k')
# Add Y-axis label "SSE"
plt.ylabel('SSE')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()));
Looking at the above elbow chart, the SSE seems to drastically drop when transitioning from k=1 to k=2 while the rest of the k seems only contribute to small decrease in the SSE. We would need to inspect the detail for k=3 to k=5 to check which number of cluster that might be best for the conference audience to understand.
8. Finding the Number of k
# Number of cluster equals to 3
kmeans3 = KMeans(n_clusters=3, random_state=296)
kmeans3.fit(scaled_hotel_cluster)
cluster_labels3 = kmeans3.labels_
k_means3 = scaled_hotel_cluster.assign(Cluster = cluster_labels3)
k_means3.groupby(['Cluster']).agg({
'price (£)': ['mean','std'],
'altitude (m)':['mean','std'],
'totalPiste (km)': ['mean','std'],
'totalLifts': ['mean','std'],
'gondolas': ['mean','std'],
'chairlifts': ['mean','std'],
'draglifts': ['mean','std'],
'blues': ['mean','std'],
'reds': ['mean','std'],
'blacks': ['mean','std'],
'totalRuns': ['mean','std'],
'decSnowLow2020(cm)': ['mean','std'],
'decSnowHigh2020(cm)': ['mean','std'],
'janSnowLow2020(cm)': ['mean','std'],
'janSnowHigh2020(cm)': ['mean','std'],
'febSnowLow2020(cm)': ['mean','std'],
'febSnowHigh2020(cm)': ['mean','std','count'],
}).round(2)
| price (£) | altitude (m) | totalPiste (km) | totalLifts | gondolas | chairlifts | draglifts | blues | reds | blacks | totalRuns | decSnowLow2020(cm) | decSnowHigh2020(cm) | janSnowLow2020(cm) | janSnowHigh2020(cm) | febSnowLow2020(cm) | febSnowHigh2020(cm) | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | count | |
| Cluster | |||||||||||||||||||||||||||||||||||
| 0 | -0.16 | 0.98 | -0.24 | 0.97 | -0.39 | 0.72 | -0.42 | 0.60 | -0.33 | 0.52 | -0.40 | 0.67 | -0.37 | 0.64 | -0.42 | 0.40 | -0.46 | 0.47 | -0.50 | 0.52 | -0.50 | 0.43 | -0.26 | 0.78 | -0.01 | 1.08 | -0.34 | 0.61 | -0.27 | 0.93 | -0.36 | 0.64 | -0.27 | 0.98 | 287 |
| 1 | 0.92 | 1.24 | 0.93 | 1.26 | 2.31 | 0.00 | 3.23 | 0.16 | 3.76 | 0.15 | 2.60 | 0.03 | 2.68 | 0.29 | 2.87 | 1.28 | 2.01 | 0.97 | 1.53 | 0.16 | 2.61 | 0.20 | 1.53 | 0.00 | 0.17 | 0.00 | 2.92 | 0.00 | 1.31 | 0.00 | 2.80 | 0.00 | 1.31 | 0.00 | 14 |
| 2 | 0.31 | 0.88 | 0.52 | 0.75 | 0.75 | 0.88 | 0.71 | 0.73 | 0.41 | 0.87 | 0.73 | 0.89 | 0.64 | 1.02 | 0.75 | 1.04 | 0.98 | 1.03 | 1.16 | 0.87 | 1.01 | 0.85 | 0.51 | 1.21 | 0.02 | 0.82 | 0.52 | 1.07 | 0.55 | 0.88 | 0.62 | 0.98 | 0.55 | 0.72 | 106 |
Looking at the result of the first cluster formation trial with k=3, the data is very nicely splitted between the variables suggesting the very distinct three cluster formation with clear difference between cluster.
# Number of cluster equals to 4
kmeans4 = KMeans(n_clusters=4, random_state=296)
kmeans4.fit(scaled_hotel_cluster)
cluster_labels4 = kmeans4.labels_
k_means4 = scaled_hotel_cluster.assign(Cluster = cluster_labels4)
k_means4.groupby(['Cluster']).agg({
'price (£)': ['mean','std'],
'altitude (m)':['mean','std'],
'totalPiste (km)': ['mean','std'],
'totalLifts': ['mean','std'],
'gondolas': ['mean','std'],
'chairlifts': ['mean','std'],
'draglifts': ['mean','std'],
'blues': ['mean','std'],
'reds': ['mean','std'],
'blacks': ['mean','std'],
'totalRuns': ['mean','std'],
'decSnowLow2020(cm)': ['mean','std'],
'decSnowHigh2020(cm)': ['mean','std'],
'janSnowLow2020(cm)': ['mean','std'],
'janSnowHigh2020(cm)': ['mean','std'],
'febSnowLow2020(cm)': ['mean','std'],
'febSnowHigh2020(cm)': ['mean','std','count'],
}).round(2)
| price (£) | altitude (m) | totalPiste (km) | totalLifts | gondolas | chairlifts | draglifts | blues | reds | blacks | totalRuns | decSnowLow2020(cm) | decSnowHigh2020(cm) | janSnowLow2020(cm) | janSnowHigh2020(cm) | febSnowLow2020(cm) | febSnowHigh2020(cm) | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | count | |
| Cluster | |||||||||||||||||||||||||||||||||||
| 0 | -0.15 | 0.97 | -0.25 | 0.99 | -0.34 | 0.75 | -0.38 | 0.60 | -0.27 | 0.60 | -0.38 | 0.68 | -0.34 | 0.64 | -0.39 | 0.42 | -0.40 | 0.52 | -0.44 | 0.58 | -0.45 | 0.47 | -0.26 | 0.79 | -0.21 | 0.75 | -0.29 | 0.60 | -0.45 | 0.56 | -0.33 | 0.63 | -0.45 | 0.57 | 280 |
| 1 | 0.22 | 0.89 | 0.89 | 0.44 | 0.64 | 0.83 | 0.73 | 0.48 | -0.07 | 0.58 | 0.75 | 0.70 | 0.87 | 0.60 | 0.29 | 0.61 | 0.42 | 0.60 | 0.98 | 0.64 | 0.49 | 0.25 | 1.14 | 1.14 | 0.45 | 0.75 | 0.87 | 1.13 | 1.14 | 0.56 | 1.14 | 0.77 | 1.02 | 0.42 | 64 |
| 2 | 0.66 | 0.98 | 0.01 | 0.96 | 1.41 | 1.06 | 1.60 | 1.40 | 1.93 | 1.38 | 1.44 | 1.25 | 1.18 | 1.59 | 2.20 | 1.11 | 2.15 | 0.83 | 1.60 | 0.97 | 2.35 | 0.30 | 0.14 | 0.99 | -0.49 | 0.49 | 0.92 | 1.49 | 0.16 | 0.82 | 0.76 | 1.51 | 0.20 | 0.78 | 46 |
| 3 | -0.17 | 1.15 | 0.66 | 0.52 | -0.63 | 0.14 | -0.75 | 0.04 | -0.52 | 0.29 | -0.40 | 0.09 | -0.87 | 0.03 | -0.66 | 0.11 | -0.82 | 0.04 | -0.71 | 0.16 | -0.82 | 0.06 | -0.44 | 0.32 | 3.05 | 0.84 | -0.93 | 0.15 | 2.67 | 0.63 | -0.98 | 0.15 | 2.98 | 0.27 | 17 |
Looking at the result of the second cluster formation trial with k=4, the distinction between cluster is not as obvious as when k equals to three as we could see cluster 0 and 3 has quite similar characteristics and needs to split better to better distinguish the two clusters.
# Number of cluster equals to 5
kmeans5 = KMeans(n_clusters=5, random_state=296)
kmeans5.fit(scaled_hotel_cluster)
cluster_labels5 = kmeans5.labels_
k_means5 = scaled_hotel_cluster.assign(Cluster = cluster_labels5)
k_means5.groupby(['Cluster']).agg({
'price (£)': ['mean','std'],
'altitude (m)':['mean','std'],
'totalPiste (km)': ['mean','std'],
'totalLifts': ['mean','std'],
'gondolas': ['mean','std'],
'chairlifts': ['mean','std'],
'draglifts': ['mean','std'],
'blues': ['mean','std'],
'reds': ['mean','std'],
'blacks': ['mean','std'],
'totalRuns': ['mean','std'],
'decSnowLow2020(cm)': ['mean','std'],
'decSnowHigh2020(cm)': ['mean','std'],
'janSnowLow2020(cm)': ['mean','std'],
'janSnowHigh2020(cm)': ['mean','std'],
'febSnowLow2020(cm)': ['mean','std'],
'febSnowHigh2020(cm)': ['mean','std','count'],
}).round(2)
| price (£) | altitude (m) | totalPiste (km) | totalLifts | gondolas | chairlifts | draglifts | blues | reds | blacks | totalRuns | decSnowLow2020(cm) | decSnowHigh2020(cm) | janSnowLow2020(cm) | janSnowHigh2020(cm) | febSnowLow2020(cm) | febSnowHigh2020(cm) | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | count | |
| Cluster | |||||||||||||||||||||||||||||||||||
| 0 | 0.32 | 0.88 | 0.69 | 0.57 | 0.65 | 0.78 | 0.76 | 0.49 | 0.08 | 0.60 | 0.89 | 0.73 | 0.76 | 0.71 | 0.58 | 0.88 | 0.79 | 0.98 | 1.31 | 0.94 | 0.86 | 0.79 | 0.90 | 1.14 | 0.21 | 0.82 | 0.84 | 1.02 | 0.85 | 0.79 | 1.02 | 0.75 | 0.80 | 0.59 | 80 |
| 1 | -0.05 | 0.98 | -0.31 | 0.92 | 0.09 | 0.85 | 0.04 | 0.37 | 0.17 | 0.68 | 0.01 | 0.49 | -0.01 | 0.56 | -0.01 | 0.70 | 0.04 | 0.78 | -0.12 | 0.59 | -0.01 | 0.68 | -0.56 | 0.56 | -0.35 | 0.63 | -0.41 | 0.46 | -0.44 | 0.49 | -0.51 | 0.37 | -0.43 | 0.52 | 179 |
| 2 | -0.17 | 1.15 | 0.66 | 0.52 | -0.63 | 0.14 | -0.75 | 0.04 | -0.52 | 0.29 | -0.40 | 0.09 | -0.87 | 0.03 | -0.66 | 0.11 | -0.82 | 0.04 | -0.71 | 0.16 | -0.82 | 0.06 | -0.44 | 0.32 | 3.05 | 0.84 | -0.93 | 0.15 | 2.67 | 0.63 | -0.98 | 0.15 | 2.98 | 0.27 | 17 |
| 3 | -0.24 | 0.97 | -0.20 | 1.06 | -0.86 | 0.27 | -0.98 | 0.29 | -0.73 | 0.47 | -0.99 | 0.39 | -0.84 | 0.32 | -0.71 | 0.21 | -0.80 | 0.22 | -0.84 | 0.26 | -0.86 | 0.22 | 0.14 | 0.88 | -0.06 | 0.87 | -0.14 | 0.74 | -0.48 | 0.63 | -0.08 | 0.83 | -0.49 | 0.62 | 114 |
| 4 | 0.72 | 1.23 | 0.67 | 1.28 | 2.36 | 0.12 | 3.32 | 0.25 | 3.20 | 1.26 | 2.72 | 0.28 | 2.99 | 0.75 | 2.85 | 1.16 | 2.03 | 0.88 | 1.47 | 0.21 | 2.60 | 0.18 | 1.13 | 0.88 | 0.01 | 0.35 | 2.29 | 1.39 | 1.13 | 0.41 | 2.12 | 1.50 | 1.03 | 0.63 | 17 |
Looking at the result from cluster formation with k=5, the data seems to split nicely in some of the variables but not that distinctive for couple of variables suggesting that some clusters have quite similar characteristics in one variable but very different characteristics in the other variables.
For the Lobsterland Conference, it is recommended to present to the audience the result of the three clusters because of its simplicity and easy to understand characteristics thus making it more intuitive for the broader audience to grasp the concept.
We could later present the finding of the five clusters for deeper dive class/session where the interested participant might want to learn the other cluster in order to maximize their profits.
9. Naming the Clusters
Cluster 0: The Alpine Haven
Cluster of being in a cozy, comfortable place amidst the rugged beauty of the mountains. This cluster the average price compared to the other two clusters, has the highest altitude, has the average total piste, average total lifts, average number of gondolas, average number of chair lifts, average number of drag lifts, average number of blues red and blacks runs, average total runs, highest snow thickness in December, relatively thick snow in January and February.
It is nicknamed "Alpine" due to its highest altitude on being on the top of the mountain and "Haven" at the same time due to its affordability but could still provide decent numbers and various options of lifts, and piste length for the occupants as well as superb thickness of the snow during the winter season, making it a "Haven" for ski lover.
This cluster might be well suited for skiier with medium to high budget who want to spends a bit more money to enjoy a decent experience of skiing.
Cluster 1: The Slope Saver
Cluster of affordability and suggests that this cluster might be a good choice for skiers on a budget. The cluster has the lowest price, lowest altitude, lowest total piste, lowest total lifts, lowest number of gondolas, lowest number of chair lifts, lower number of drag lifts, lowest blues reds and blacks runs, lowest total runs, and lowest thickness of snow in all of winter months.
The cluster resort might offer a different kind of skiing experience focused on the natural beauty of the surrounding valley instead of the high peak altitude of the mountain while making most of the thinner snow conditions by seeking out the best spots on the mountain.
This resort cluster is best suited for skiier who wants to maximize the skiing months with limited budget or for the skiier who might want to maximize their available budget by doing "ski-hop" between resorts.
Cluster 2: The Glacier Elite
Cluster of reaching the highest levels of skiing ability, and suggests that this cluster might be the best choice for serious skiers looking to push themselves to the limit.high-end amenities and the idea of exclusivity, suggesting that this might be the ideal choice for skiers looking for a more upscale experience.
This cluster has the highest price, average altitude, highest total piste, highest total lifts, highest number of gondolas, highest number of chair lifts, highest number of drag lifts, highest blues reds and blacks runs, highest total runs, average snow thickness during winter months.
This resort cluster might be best suited for skiier with high-end budget and high level of skiing proficiency looking for a more serious challenge.
10. Preparing Clustered Dataset for Visualization
# Assigning back the cluster label to hotel_cluster dataset columns
cluster = hotel_cluster.assign(Cluster = cluster_labels3)
cluster['Cluster'] = cluster['Cluster'].astype('category')
# Renaming the cluster
cluster['Cluster'] = cluster['Cluster'].cat.rename_categories({
0:'The Alpine Haven',
1:'The Slope Saver',
2:'The Glacier Elite',
})
# Adding back the dropped variable from the original hotel dataset into cluster dataset
cluster['country'] = hotel['country']
cluster['resort'] = hotel['resort']
cluster['hotel'] = hotel['hotel']
cluster['resort_link'] = hotel['link']
# reposition the columns
cluster = cluster.reindex(columns=['country', 'resort', 'hotel', 'Cluster', 'price (£)',
'altitude (m)', 'totalPiste (km)', 'totalLifts',
'gondolas', 'chairlifts', 'draglifts', 'blues', 'reds',
'blacks','totalRuns', 'decSnowLow2020(cm)',
'decSnowHigh2020(cm)','janSnowLow2020(cm)', 'janSnowHigh2020(cm)',
'febSnowLow2020(cm)','febSnowHigh2020(cm)'])
# merging the snow data for each month
cluster['snow_dec'] = (cluster['decSnowLow2020(cm)'] + cluster['decSnowHigh2020(cm)'])/2
cluster['snow_jan'] = (cluster['janSnowLow2020(cm)'] + cluster['janSnowHigh2020(cm)'])/2
cluster['snow_feb'] = (cluster['febSnowLow2020(cm)'] + cluster['febSnowHigh2020(cm)'])/2
# dropping duplicate variables
cluster.drop(columns=['decSnowLow2020(cm)','decSnowHigh2020(cm)','janSnowLow2020(cm)',
'janSnowHigh2020(cm)', 'febSnowLow2020(cm)','febSnowHigh2020(cm)'], inplace=True)
11. Visualizing Clusters: Total Piste vs Altitude
# Compute mean total piste length and mean altitude for each cluster
cluster_means = cluster.groupby('Cluster')[['totalPiste (km)', 'altitude (m)']].mean()
# Plotting scatterplot between Total Piste vs Altitude
plt.figure(figsize=(7.5,4))
plt.xlabel("Total Piste Length (km)")
plt.ylabel("Altitude (m)")
plt.title("Resort Total Piste vs Altitude")
sns.despine()
pl = sns.scatterplot(
x='totalPiste (km)',
y='altitude (m)',
s=100,
hue='Cluster',
data=cluster)
# Define centroid colors
cluster_colors = ['steelblue','darkorange','green']
# Add centroid (mean) points and labels for each cluster
for i in range(len(cluster_means)):
plt.plot(cluster_means.iloc[i]['totalPiste (km)'], cluster_means.iloc[i]['altitude (m)'],
marker='s', markersize=15, color=cluster_colors[i])
12. Visualizing Clusters: Snow Thickness Trend
sns.catplot(y='Cluster', x='value', col='variable', kind='box',
data=pd.melt(cluster[['Cluster', 'snow_dec', 'snow_jan', 'snow_feb']], id_vars=['Cluster']));
13. Visualizing Cluster: Number of Runs
# Reshape data using melt
melted_df = cluster[['blues', 'reds', 'blacks', 'Cluster']].melt(id_vars='Cluster', var_name='Variable', value_name='Value')
# Convert Cluster column to categorical
melted_df['Cluster'] = pd.Categorical(melted_df['Cluster'])
# Plot bar chart
sns.barplot(x='Variable', y='Value', hue='Cluster', data=melted_df, ci=None)
plt.title("Number of Runs in Each Resorts Cluster");
14. Visualizing Cluster: Prices
plt.figure(figsize=(10, 6))
sns.barplot(x='Cluster', y='price (£)', hue='Cluster', data=cluster, dodge=False, ci=None)
plt.xlabel("Cluster")
plt.ylabel("Price per Night (£)")
plt.title("Cluster vs Price per Night (£)");
15. Cluster Visualization: Cluster GIS The Alpine Haven
# Convert all country names to title case
cluster['country'] = cluster['country'].str.title()
# Create a dictionary of country names and their ISO Alpha-3 codes
iso_codes = {country.name: country.alpha_3 for country in pycountry.countries}
# Replace the country names with their ISO Alpha-3 codes in the 'country' Series
cluster['iso_codes'] = cluster['country'].replace(iso_codes)
# Filter the rows that correspond to 'The Alpine Haven' cluster
alpine_haven = cluster[cluster['Cluster'] == 'The Alpine Haven']
# Group the rows by country and count the size of each group
country_counts_ah = alpine_haven.groupby('country').size()
# Reset the index to turn the resulting Series into a DataFrame
country_counts_ah = country_counts_ah.reset_index()
# Map the country names to their ISO Alpha-3 codes
country_counts_ah['iso_codes'] = country_counts_ah['country'].map(iso_codes)
# Rename the columns to 'country', 'count', and 'iso_codes'
country_counts_ah.columns = ['country', 'count', 'iso_codes']
# Create a choropleth map using plotly
fig = px.choropleth(country_counts_ah,
locations="iso_codes",
color="count",
hover_name="country",
projection="mercator",
title="The Alpine Haven: Country Count"
)
fig.show()
16. Cluster Visualization: Cluster GIS The Slope Saver
# Filter the rows that correspond to 'The Slope Saver' cluster
slope_saver = cluster[cluster['Cluster'] == 'The Slope Saver']
# Group the rows by country and count the size of each group
country_counts_ss = slope_saver.groupby('country').size()
# Reset the index to turn the resulting Series into a DataFrame
country_counts_ss = country_counts_ss.reset_index()
# Map the country names to their ISO Alpha-3 codes
country_counts_ss['iso_codes'] = country_counts_ss['country'].map(iso_codes)
# Rename the columns to 'country', 'count', and 'iso_codes'
country_counts_ss.columns = ['country', 'count', 'iso_codes']
# Create a choropleth map using plotly
fig = px.choropleth(country_counts_ss,
locations="iso_codes",
color="count",
hover_name="country",
projection="mercator",
title="The Slope Saver: Country Count"
)
fig.show()
17. Cluster Visualization: Cluster GIS The Glacier Elite
# Filter the rows that correspond to 'The Glacier Elite' cluster
glacier_elite = cluster[cluster['Cluster'] == 'The Glacier Elite']
# Group the rows by country and count the size of each group
country_counts_ge = glacier_elite.groupby('country').size()
# Reset the index to turn the resulting Series into a DataFrame
country_counts_ge = country_counts_ge.reset_index()
# Map the country names to their ISO Alpha-3 codes
country_counts_ge['iso_codes'] = country_counts_ge['country'].map(iso_codes)
# Rename the columns to 'country', 'count', and 'iso_codes'
country_counts_ge.columns = ['country', 'count', 'iso_codes']
# Create a choropleth map using plotly
fig = px.choropleth(country_counts_ge,
locations="iso_codes",
color="count",
hover_name="country",
projection="mercator",
title="The Glacier Elite: Country Count"
)
fig.show()
18. Cluster Visualization: Total Piste vs Total Runs
# Plotting scatterplot between Total Piste vs Total Runs
plt.figure(figsize=(7.5,4))
plt.xlabel("Total Piste Length (km)")
plt.ylabel("Total Runs")
plt.title("Resort Total Piste vs Total Runs")
sns.despine()
pl2 = sns.scatterplot(
x='totalPiste (km)',
y='totalRuns',
s=100,
hue='Cluster',
data=cluster)
1. Importing Datasets
costs = pd.read_csv("amenity_costs.csv")
amenity = pd.read_csv("hotel_amenities.csv")
2. Checking for NA Values
amenity.isna().any()
WiFi_Network False breakfast False parking False gym False flex_check False shuttle_bus False air_pure False jacuzzi False VIP_shop False pool_temp False avg_rating False dtype: bool
3. Checking for Unique Value in Each Variables
for i in amenity.columns[0:10]:
print(i,amenity[i].unique())
WiFi_Network ['Basic' 'Strong' 'Best in Class'] breakfast ['None' 'Continental' 'Full Buffet'] parking ['Valet' 'Open Lot'] gym ['None' 'Basic' 'Advanced' 'Super'] flex_check ['No' 'Yes'] shuttle_bus ['No' 'Yes'] air_pure ['No' 'Yes'] jacuzzi ['No' 'Yes'] VIP_shop ['No' 'Yes'] pool_temp [76 80 84]
4. Dummifying Variables
amenity2 = pd.get_dummies(amenity, columns = ['WiFi_Network', 'breakfast', 'parking', 'gym',
'flex_check', 'shuttle_bus', 'air_pure', 'jacuzzi',
'VIP_shop', 'pool_temp'])
amenity2.head()
| avg_rating | WiFi_Network_Basic | WiFi_Network_Best in Class | WiFi_Network_Strong | breakfast_Continental | breakfast_Full Buffet | breakfast_None | parking_Open Lot | parking_Valet | gym_Advanced | gym_Basic | gym_None | gym_Super | flex_check_No | flex_check_Yes | shuttle_bus_No | shuttle_bus_Yes | air_pure_No | air_pure_Yes | jacuzzi_No | jacuzzi_Yes | VIP_shop_No | VIP_shop_Yes | pool_temp_76 | pool_temp_80 | pool_temp_84 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.57 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | 7.60 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 5.66 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| 3 | 2.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
| 4 | 4.56 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
5. Building Linear Regression Model
X = amenity2[['WiFi_Network_Strong', 'WiFi_Network_Best in Class',
'breakfast_Continental', 'breakfast_Full Buffet', 'parking_Valet', 'gym_Basic',
'gym_Advanced', 'gym_Super', 'flex_check_Yes', 'shuttle_bus_Yes',
'air_pure_Yes', 'jacuzzi_Yes', 'VIP_shop_Yes', 'pool_temp_80',
'pool_temp_84']]
y = amenity2['avg_rating']
regressor = LinearRegression()
regressor.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
6. Checking the Linear Regression Coefficient on the Amenities
coef_df = pd.DataFrame(regressor.coef_, X.columns, columns = ['Coefficient'])
coef_df
| Coefficient | |
|---|---|
| WiFi_Network_Strong | 1.187700 |
| WiFi_Network_Best in Class | 1.726814 |
| breakfast_Continental | 0.225924 |
| breakfast_Full Buffet | 0.726923 |
| parking_Valet | 0.093678 |
| gym_Basic | 0.022170 |
| gym_Advanced | 0.168484 |
| gym_Super | 0.212847 |
| flex_check_Yes | 0.478220 |
| shuttle_bus_Yes | 0.419939 |
| air_pure_Yes | 0.075258 |
| jacuzzi_Yes | 0.183909 |
| VIP_shop_Yes | 0.217925 |
| pool_temp_80 | 0.074744 |
| pool_temp_84 | 0.263806 |
7. Checking NA Values in Cost Dataset
costs.isna().any()
Amenity False Level False Estimated Incremental Cost,\nPer Visitor/Per Night False dtype: bool
8. Preparing Cost Data
costs.loc[costs['Amenity'] == 'pool temp', 'Amenity'] = 'pool_temp'
row1 = costs.loc[6]
row2 = costs.loc[7]
# Swap their positions using indexing
costs.loc[6] = row2
costs.loc[7] = row1
9. Splitting Linear Regression Coefficient Dataframe
split = coef_df.index.str.split('_', expand=True)
split_df = split.to_frame()
split_df.index = range(len(split_df))
10. Creating Level Combination Dataframe for Amenities
split_df[['Amenity', 'Level']] = split_df.apply(
lambda row: (
(row[0] + '_' + row[1], row[2]) if not pd.isna(row[2]) else (row[0], row[1])
),
axis=1,
result_type='expand'
)
split_df = split_df.drop(columns=[0, 1, 2])
split_df['Coefficient'] = coef_df.Coefficient.values
11. Merging Cost Data and Level Amenities Combination Data
merged_df = costs.merge(split_df, on=['Amenity', 'Level'], how='outer')
merged_df = merged_df.fillna(0)
merged_df
| Amenity | Level | Estimated Incremental Cost,\nPer Visitor/Per Night | Coefficient | |
|---|---|---|---|---|
| 0 | WiFi_Network | Basic | 11.75 | 0.000000 |
| 1 | WiFi_Network | Strong | 16.25 | 1.187700 |
| 2 | WiFi_Network | Best in Class | 19.15 | 1.726814 |
| 3 | breakfast | None | 0.00 | 0.000000 |
| 4 | breakfast | Continental | 13.25 | 0.225924 |
| 5 | breakfast | Full Buffet | 22.45 | 0.726923 |
| 6 | parking | Open Lot | 15.00 | 0.000000 |
| 7 | parking | Valet | 60.00 | 0.093678 |
| 8 | gym | None | 0.00 | 0.000000 |
| 9 | gym | Basic | 10.00 | 0.022170 |
| 10 | gym | Advanced | 35.00 | 0.168484 |
| 11 | gym | Super | 65.00 | 0.212847 |
| 12 | flex_check | No | 0.00 | 0.000000 |
| 13 | flex_check | Yes | 12.00 | 0.478220 |
| 14 | shuttle_bus | No | 0.00 | 0.000000 |
| 15 | shuttle_bus | Yes | 75.00 | 0.419939 |
| 16 | air_pure | No | 0.00 | 0.000000 |
| 17 | air_pure | Yes | 12.85 | 0.075258 |
| 18 | jacuzzi | No | 0.00 | 0.000000 |
| 19 | jacuzzi | Yes | 40.00 | 0.183909 |
| 20 | VIP_shop | No | 0.00 | 0.000000 |
| 21 | VIP_shop | Yes | 12.00 | 0.217925 |
| 22 | pool_temp | 76 | 15.00 | 0.000000 |
| 23 | pool_temp | 80 | 35.00 | 0.074744 |
| 24 | pool_temp | 84 | 45.00 | 0.263806 |
12. Creating All Possible Combination of Amenities along with Cost
# Group the data by amenity
grouped = merged_df.groupby("Amenity")
# Create a list of all possible combinations of amenities
combinations = [list(group.index) for _, group in grouped]
all_combinations = list(itertools.product(*combinations))
best_coefficient = 0
best_combination = None
for combination in all_combinations:
selected_amenities = merged_df.iloc[list(combination)]
total_cost = selected_amenities["Estimated Incremental Cost,\nPer Visitor/Per Night"].sum()
total_coefficient = selected_amenities["Coefficient"].sum()
# Check if the current combination is better and meets the cost constraint
if total_cost <= 150 and total_coefficient > best_coefficient:
best_coefficient = total_coefficient
best_combination = selected_amenities
# Print the best combination
print("Best combination:")
print(best_combination[["Amenity", "Level", "Estimated Incremental Cost,\nPer Visitor/Per Night", "Coefficient"]])
# Print the total cost and total coefficient
print("\nTotal cost:", best_combination["Estimated Incremental Cost,\nPer Visitor/Per Night"].sum())
print("Total coefficient:", best_combination["Coefficient"].sum())
Best combination:
Amenity Level \
21 VIP_shop Yes
2 WiFi_Network Best in Class
17 air_pure Yes
5 breakfast Full Buffet
13 flex_check Yes
9 gym Basic
18 jacuzzi No
6 parking Open Lot
24 pool_temp 84
14 shuttle_bus No
Estimated Incremental Cost,\nPer Visitor/Per Night Coefficient
21 12.00 0.217925
2 19.15 1.726814
17 12.85 0.075258
5 22.45 0.726923
13 12.00 0.478220
9 10.00 0.022170
18 0.00 0.000000
6 15.00 0.000000
24 45.00 0.263806
14 0.00 0.000000
Total cost: 148.45
Total coefficient: 3.5111168981481504
After analyzing the Golden Arch Hotel case study, I have gathered insights on the strengths, weaknesses, opportunities, and threats (SWOT) that McDonald's faced when entering the hotel market. This analysis will help inform recommendations for Lobster Land's potential overseas expansion. SWOT Analysis for Golden Arch Hotel:
Strengths:
Weaknesses:
Opportunities: · Capitalize on the established McDonald's brand to attract customers. · Collaboration with airlines, tour operators, and business travelers to increase occupancy rate. · Leverage the hotel's proximity to the airport and exhibition area. Threats: · Strong competition from other established hotel chains. · Negative customer experiences impacting the hotel's reputation. · Potential language and cultural barriers.
The Golden Arch Hotel capitalized on the McDonald's brand, leveraging its strong brand recognition to create awareness and generate interest in their hotel. This strategy allowed them to tap into the existing customer base and brand loyalty of McDonald's, which played a significant role in attracting guests to the hotel. Additionally, they implemented technologically advanced amenities, such as TV with wireless keyboard and an internet wifi (which at that time it was still uncommon), catering to the needs of their target market of business travelers. By offering these amenities, they successfully positioned the hotel as a convenient and attractive option for their intended audience.
However, there were areas for improvement. Aligning brand expectations was a crucial aspect that McDonald's should have addressed more effectively. The association between their fast-food brand and a four-star hotel created confusion among customers who expected a more upscale experience. McDonald's could have developed a sub-brand or utilized distinct branding elements to differentiate the Golden Arch Hotel from their fast-food restaurants, allowing them to set more appropriate customer expectations.
Another area for improvement is the enhancement of food and beverage offerings. The hotel's restaurant and bar lacked variety and ambiance, which detracted from the overall guest experience. Offering a wider range of food options, hiring skilled chefs, and investing in creating a more inviting ambiance could have significantly improved customer satisfaction. Incorporating elements of local cuisine and culture into the menu would have also helped the hotel to stand out from competitors and provide guests with a unique experience.
Lastly, ensuring a consistent customer experience is vital in the hotel industry. McDonald's should have focused on delivering a high-quality experience across all aspects of the hotel, such as room cleanliness, staff service, and facilities maintenance. By doing so, they could have fostered customer satisfaction and generated positive word of mouth, which is invaluable in the hospitality sector. To achieve this, they could have invested in comprehensive staff training and regular quality assurance audits to ensure that every aspect of the hotel met or exceeded the expectations of their guests.
By addressing these areas for improvement, the Golden Arch Hotel could have built on the strong foundation established through their effective use of the McDonald's brand and positioning in the market. Ultimately, this would have led to increased customer satisfaction and long-term success in the competitive hotel industry.
Recommendation for Lobsterland
Drawing from the Golden Arch Hotel case study, Lobster Land should focus on establishing a clear brand identity that differentiates it from competitors by emphasizing its unique seaside atmosphere, affordability, and family-friendly experience. Aligning brand expectations is crucial for Lobster Land to avoid confusion that might arise from brand associations. Localization and catering to local preferences is essential as Lobster Land continues to grow, which can be achieved by incorporating local ingredients and traditions, collaborating with local businesses, or creating region-specific attractions.
Ensuring a consistent, high-quality customer experience across all aspects of Lobster Land is vital for customer satisfaction and positive word of mouth. This includes ride operations, staff service, cleanliness, and other facilities. Lobster Land can also benefit from leveraging advanced technology solutions, such as park-issued wristbands for payments and ride access. By collecting and analyzing data from these wristbands, Lobster Land can optimize operations, enhance visitor experiences, and make data-driven decisions for future improvements.
Inspired by the Golden Arch Hotel case, Lobster Land can create seasonal events and promotions to attract visitors during peak times, with themed activities, entertainment, or limited-time food offerings that celebrate local culture and traditions. Demonstrating a commitment to sustainability and community engagement can enhance Lobster Land's reputation and appeal. Initiatives such as waste reduction, energy conservation, and support for local charities or events can help position the park as a responsible and caring member of the community. By implementing these recommendations, Lobster Land can strengthen its brand identity, enhance visitor experiences, and foster long-term success as a popular seaside amusement park.
df= pd.read_csv("hotel_satisfaction.csv")
df.head()
| id | Gender | Age | purpose_of_travel | Type of Travel | Type Of Booking | Hotel wifi service | Departure/Arrival convenience | Ease of Online booking | Hotel location | Food and drink | Stay comfort | Common Room entertainment | Checkin/Checkout service | Other service | Cleanliness | satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70172 | Male | 13 | aviation | Personal Travel | Not defined | 3 | 4 | 3 | 1 | 5 | 5 | 5 | 4 | 5 | 5 | neutral or dissatisfied |
| 1 | 5047 | Male | 25 | tourism | Group Travel | Group bookings | 3 | 2 | 3 | 3 | 1 | 1 | 1 | 1 | 4 | 1 | neutral or dissatisfied |
| 2 | 110028 | Female | 26 | tourism | Group Travel | Group bookings | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 4 | 4 | 5 | satisfied |
| 3 | 24026 | Female | 25 | tourism | Group Travel | Group bookings | 2 | 5 | 5 | 5 | 2 | 2 | 2 | 1 | 4 | 2 | neutral or dissatisfied |
| 4 | 119299 | Male | 61 | aviation | Group Travel | Group bookings | 3 | 3 | 3 | 3 | 4 | 5 | 3 | 3 | 3 | 3 | satisfied |
df.shape
(103904, 17)
list(df.columns )
['id', 'Gender', 'Age', 'purpose_of_travel', 'Type of Travel', 'Type Of Booking', 'Hotel wifi service', 'Departure/Arrival convenience', 'Ease of Online booking', 'Hotel location', 'Food and drink', 'Stay comfort', 'Common Room entertainment', 'Checkin/Checkout service', 'Other service', 'Cleanliness', 'satisfaction']
Categorical and Numerical Variables
Categotical Variables are:
id, Gender, purpose_of_travel, Type of Travel, Type Of Booking, satisfaction
Numerical Variables are:
Age, Hotel wifi service, 'Departure/Arrival convenience','Ease of Online booking', 'Hotel location', 'Food and drink', 'Stay comfort', 'Common Room entertainment', 'Checkin/Checkout service', 'Other service', 'Cleanliness'
Values of target variable 'satisfaction'
df['satisfaction'].value_counts()
neutral or dissatisfied 58879 satisfied 45025 Name: satisfaction, dtype: int64
We do not have significant class imbalance present.
Are there NA's present in this dataset?
nums = df[['Age','Hotel location','Hotel wifi service',
'Departure/Arrival convenience',
'Ease of Online booking',
'Hotel location',
'Food and drink',
'Stay comfort',
'Common Room entertainment',
'Checkin/Checkout service',
'Other service',
'Cleanliness']]
cor_table = nums.corr()
plt.figure(figsize=(20, 12 ))
heatmap = sns.heatmap(cor_table, vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':20}, pad=20);
We examined the correlations of numerical variables to check for multicollinearity issues. We set a threshold of -0.7 or +0.7 to identify variables as multicollinear or not.
We identified the correlation of "ease of online booking" and "Hotel wifi service" as problematic. We see a .72 strong positive correlation and therefore decided to remove 'Ease of online booking'.
This makes sense as the hotel wifi service contributes to ease of booking. The variables are describing the same thing.
Remove ID
df2 = df.drop(['id', 'Ease of Online booking'], axis=1)
Convert Categorical Variables into numerical columns
cat_cols = ['Gender', 'purpose_of_travel', 'Type of Travel', 'Type Of Booking', 'satisfaction']
df_dum = pd.get_dummies(df2, drop_first=True,
columns= cat_cols)
Data Partition
from sklearn.model_selection import train_test_split
random.seed(8)
X = df_dum.loc[:, df_dum.columns != 'satisfaction_satisfied']
y = df_dum['satisfaction_satisfied']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.4, random_state=3)
print(X_train.shape, X_test.shape, type(X_train), type(X_test))
print(y_train.shape, y_test.shape, type(y_train), type(y_test))
(62342, 18) (41562, 18) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> (62342,) (41562,) <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
Identifying Mean Values to check fro first hints on what might impact Satisfaction
churn_m = df_dum.groupby('satisfaction_satisfied').mean()
churn_m
| Age | Hotel wifi service | Departure/Arrival convenience | Hotel location | Food and drink | Stay comfort | Common Room entertainment | Checkin/Checkout service | Other service | Cleanliness | Gender_Male | purpose_of_travel_aviation | purpose_of_travel_business | purpose_of_travel_personal | purpose_of_travel_tourism | Type of Travel_Personal Travel | Type Of Booking_Individual/Couple | Type Of Booking_Not defined | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| satisfaction_satisfied | ||||||||||||||||||
| 0 | 37.566688 | 2.399633 | 3.129112 | 2.976121 | 2.95805 | 3.036295 | 2.894156 | 3.042952 | 3.388814 | 2.936123 | 0.487203 | 0.134819 | 0.203519 | 0.091425 | 0.307461 | 0.492026 | 0.646139 | 0.095960 |
| 1 | 41.750583 | 3.161288 | 2.970305 | 2.977879 | 3.52131 | 3.966530 | 3.964931 | 3.646041 | 3.969461 | 3.744342 | 0.499522 | 0.131216 | 0.205552 | 0.092504 | 0.309828 | 0.072826 | 0.193248 | 0.040955 |
Variables that might impactful based on simple mean comparison between outcome classes are:
Building our Predictive Model: Logitic Regression Model
First Iteration
logit_model = sm.Logit(y_train, sm.add_constant(X_train))
result = logit_model.fit()
print(result.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 7
Logit Regression Results
==================================================================================
Dep. Variable: satisfaction_satisfied No. Observations: 62342
Model: Logit Df Residuals: 62323
Method: MLE Df Model: 18
Date: Wed, 10 Jan 2024 Pseudo R-squ.: inf
Time: 00:11:00 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
=====================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const -4.6981 0.079 -59.709 0.000 -4.852 -4.544
Age 0.0137 0.001 17.028 0.000 0.012 0.015
Hotel wifi service 0.5716 0.010 56.119 0.000 0.552 0.592
Departure/Arrival convenience -0.0556 0.009 -5.879 0.000 -0.074 -0.037
Hotel location -0.1748 0.010 -16.872 0.000 -0.195 -0.155
Food and drink -0.1362 0.012 -11.068 0.000 -0.160 -0.112
Stay comfort 0.2504 0.013 19.818 0.000 0.226 0.275
Common Room entertainment 0.3701 0.015 24.575 0.000 0.341 0.400
Checkin/Checkout service 0.3398 0.010 35.176 0.000 0.321 0.359
Other service 0.1936 0.012 16.281 0.000 0.170 0.217
Cleanliness 0.1716 0.014 12.405 0.000 0.145 0.199
Gender_Male 0.0394 0.023 1.740 0.082 -0.005 0.084
purpose_of_travel_aviation 0.0043 0.038 0.112 0.911 -0.070 0.079
purpose_of_travel_business -0.0307 0.033 -0.921 0.357 -0.096 0.035
purpose_of_travel_personal -0.0058 0.043 -0.135 0.893 -0.091 0.079
purpose_of_travel_tourism -0.0389 0.030 -1.294 0.196 -0.098 0.020
Type of Travel_Personal Travel -1.8809 0.035 -53.810 0.000 -1.949 -1.812
Type Of Booking_Individual/Couple -1.4889 0.027 -54.378 0.000 -1.543 -1.435
Type Of Booking_Not defined -1.3149 0.048 -27.354 0.000 -1.409 -1.221
=====================================================================================================
Based on this regression output we can identify variables that do not have a high predictive value. Based on the p-values, we can say that 'purpose of travel' is insignificant across all levels. Gender is also not significant with a p-value of 0.08. We will therefore remove these variables and then run the regression again.
Iteration #2
X_train2 = X_train.drop(['purpose_of_travel_aviation', 'purpose_of_travel_business', 'purpose_of_travel_personal', 'purpose_of_travel_tourism', 'Gender_Male'], axis=1)
X_train2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 62342 entries, 4830 to 71530 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 62342 non-null int64 1 Hotel wifi service 62342 non-null int64 2 Departure/Arrival convenience 62342 non-null int64 3 Hotel location 62342 non-null int64 4 Food and drink 62342 non-null int64 5 Stay comfort 62342 non-null int64 6 Common Room entertainment 62342 non-null int64 7 Checkin/Checkout service 62342 non-null int64 8 Other service 62342 non-null int64 9 Cleanliness 62342 non-null int64 10 Type of Travel_Personal Travel 62342 non-null uint8 11 Type Of Booking_Individual/Couple 62342 non-null uint8 12 Type Of Booking_Not defined 62342 non-null uint8 dtypes: int64(10), uint8(3) memory usage: 5.4 MB
logit_model2 = sm.Logit(y_train, sm.add_constant(X_train2))
result2 = logit_model2.fit()
print(result2.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 7
Logit Regression Results
==================================================================================
Dep. Variable: satisfaction_satisfied No. Observations: 62342
Model: Logit Df Residuals: 62328
Method: MLE Df Model: 13
Date: Wed, 10 Jan 2024 Pseudo R-squ.: inf
Time: 00:11:00 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
=====================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const -4.6992 0.076 -62.112 0.000 -4.848 -4.551
Age 0.0137 0.001 17.035 0.000 0.012 0.015
Hotel wifi service 0.5717 0.010 56.136 0.000 0.552 0.592
Departure/Arrival convenience -0.0556 0.009 -5.876 0.000 -0.074 -0.037
Hotel location -0.1748 0.010 -16.869 0.000 -0.195 -0.154
Food and drink -0.1355 0.012 -11.027 0.000 -0.160 -0.111
Stay comfort 0.2501 0.013 19.793 0.000 0.225 0.275
Common Room entertainment 0.3693 0.015 24.541 0.000 0.340 0.399
Checkin/Checkout service 0.3398 0.010 35.178 0.000 0.321 0.359
Other service 0.1937 0.012 16.291 0.000 0.170 0.217
Cleanliness 0.1725 0.014 12.481 0.000 0.145 0.200
Type of Travel_Personal Travel -1.8802 0.035 -53.791 0.000 -1.949 -1.812
Type Of Booking_Individual/Couple -1.4892 0.027 -54.392 0.000 -1.543 -1.436
Type Of Booking_Not defined -1.3169 0.048 -27.403 0.000 -1.411 -1.223
=====================================================================================================
Checked Model for seeing impact of Type of Travel and Type of Booking
X_train3 = X_train.drop(['purpose_of_travel_aviation', 'purpose_of_travel_business', 'purpose_of_travel_personal', 'purpose_of_travel_tourism', 'Gender_Male', 'Type of Travel_Personal Travel', 'Type Of Booking_Individual/Couple', 'Type Of Booking_Not defined'], axis=1)
logit_model3 = sm.Logit(y_train, sm.add_constant(X_train3))
result3 = logit_model3.fit()
print(result3.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 6
Logit Regression Results
==================================================================================
Dep. Variable: satisfaction_satisfied No. Observations: 62342
Model: Logit Df Residuals: 62331
Method: MLE Df Model: 10
Date: Wed, 10 Jan 2024 Pseudo R-squ.: inf
Time: 00:11:00 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -5.9751 0.066 -90.988 0.000 -6.104 -5.846
Age 0.0173 0.001 26.030 0.000 0.016 0.019
Hotel wifi service 0.6017 0.009 67.096 0.000 0.584 0.619
Departure/Arrival convenience -0.3234 0.008 -41.899 0.000 -0.339 -0.308
Hotel location -0.0515 0.009 -5.824 0.000 -0.069 -0.034
Food and drink -0.1813 0.011 -16.767 0.000 -0.202 -0.160
Stay comfort 0.3565 0.011 33.132 0.000 0.335 0.378
Common Room entertainment 0.4031 0.013 30.622 0.000 0.377 0.429
Checkin/Checkout service 0.3500 0.008 41.396 0.000 0.333 0.367
Other service 0.2476 0.010 23.895 0.000 0.227 0.268
Cleanliness 0.0921 0.012 7.542 0.000 0.068 0.116
==================================================================================================
#train on logmodel scikit
logmodel= LogisticRegression()
logmodel.fit(X_train2, y_train)
#adjust Xtest for evaluation of logistic regression
X_test2 = X_test.drop(['purpose_of_travel_aviation', 'purpose_of_travel_business', 'purpose_of_travel_personal', 'purpose_of_travel_tourism', 'Gender_Male'], axis=1)
#predict
predictions_train = logmodel.predict(X_train2)
predictions_test = logmodel.predict(X_test2)
Confusion Matrix for Test Set
cm = confusion_matrix(predictions_test, y_test)
sns.heatmap(cm, square= True, fmt = 'g', annot=True, cbar= False)
plt.xlabel("actual result")
plt.ylabel("predicted result")
a,b = plt.ylim()
a+=.5
b-=.5
plt.ylim(a,b)
plt.show()
tn, fp, fn, tp = cm.ravel()
#
accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
precision = tp / (tp + fp)
balanced_accuracy = (sensitivity + specificity) / 2
print("Accuracy:", accuracy)
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("Precision:", precision)
print("Balanced Accuracy:", balanced_accuracy)
Accuracy: 0.8405995861604351 Sensitivity: 0.8180702732849973 Specificity: 0.8576929587000677 Precision: 0.8134878819810326 Balanced Accuracy: 0.8378816159925324
Confusion Matrix for Train Set
cm2 = confusion_matrix(predictions_train, y_train)
sns.heatmap(cm2, square= True, fmt = 'g', annot=True, cbar= False)
plt.xlabel("actual result")
plt.ylabel("predicted result")
a,b = plt.ylim()
a+=.5
b-=.5
plt.ylim(a,b)
plt.show()
tn, fp, fn, tp = cm2.ravel()
#
accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
precision = tp / (tp + fp)
balanced_accuracy = (sensitivity + specificity) / 2
print("Accuracy:", accuracy)
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("Precision:", precision)
print("Balanced Accuracy:", balanced_accuracy)
Accuracy: 0.8418401719547015 Sensitivity: 0.8171318575553417 Specificity: 0.860733695652174 Precision: 0.8177372749499889 Balanced Accuracy: 0.8389327766037578
There is a slight decrease in accuracy, but overall the model generalizes very well from training to test data. The accuracy metrics are high.
Additional Test Set Stats for Checked Model:
#train on logmodel scikit
logmodel= LogisticRegression()
logmodel.fit(X_train3, y_train)
#adjust Xtest for evaluation of logistic regression
X_test3 = X_test.drop(['purpose_of_travel_aviation', 'purpose_of_travel_business', 'purpose_of_travel_personal', 'purpose_of_travel_tourism', 'Gender_Male', 'Type of Travel_Personal Travel', 'Type Of Booking_Individual/Couple', 'Type Of Booking_Not defined'], axis=1)
#predict
predictions_train = logmodel.predict(X_train3)
predictions_test = logmodel.predict(X_test3)
cm2 = confusion_matrix(predictions_train, y_train)
sns.heatmap(cm2, square= True, fmt = 'g', annot=True, cbar= False)
plt.xlabel("actual result")
plt.ylabel("predicted result")
a,b = plt.ylim()
a+=.5
b-=.5
plt.ylim(a,b)
plt.show()
tn, fp, fn, tp = cm2.ravel()
#
accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
precision = tp / (tp + fp)
balanced_accuracy = (sensitivity + specificity) / 2
print("Accuracy:", accuracy)
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("Precision:", precision)
print("Balanced Accuracy:", balanced_accuracy)
Accuracy: 0.7644445157357801 Sensitivity: 0.7332575326890278 Specificity: 0.7873293100091776 Precision: 0.7167148255167816 Balanced Accuracy: 0.7602934213491027
This model performs worse than the model above
Interpretation of the Logitic Regression Model and its meaning:
To give attendees of the conference a better understanding of what is influencing their guests satisfaction, we can look at the coeffiecients of the logistic regression constructed above and draw conclusions about the impact of certain variables on satisfaction.
Based on coefficients of the model (iteration #2) we can say that Hotel Wifi Service is one of the most impactful things when it comes to customer satisfaction. Marketing initiatives geared towards exploiting this particular consumer preference would be installing and then advertising supreme internet connectivity ability. On the booking page, a sticker with "500mb/s is our standard" hotels might be able to convince guests that are looking for a connected stay.
Three variables, related to convenience inside the hotel, that strongly impact satisfaction positively are common room entertainment, check-in/check-out, and stay conveniece in general. In order to use the first variable, a hotel could advertise certain paid-for channels (e.g. "our rooms have netflix/hbo on demand"). Based on the second variable, a hotel might be better off to analyze their check-in and check-out processes and adjust.
The type of travel can have a significantly negative impact on the travel experience. It would be worthwhile for a hotel management to analyze the different preferences for different types of bookings to see how they can tailor these preferences around specific booking groups. Other variables that hotels should be wary about when trying to please customer is the departure arrival conveniece. When imagining a city based hotel, clear instructions on how to find the hotel and where to park might help with satisfaction. Obstacle free movement of luggage is a second thing to look at.
st= pd.read_csv("promo_pics.csv")
st.head()
| recipient | pic_seen | site_duration | spend | register | |
|---|---|---|---|---|---|
| 0 | 1 | Sunset | 18.20 | 16.60 | 0 |
| 1 | 2 | Main St | 28.61 | 15.30 | 0 |
| 2 | 3 | Waterslide | 10.90 | 16.32 | 1 |
| 3 | 4 | Waterslide | 11.30 | 22.62 | 0 |
| 4 | 5 | Sunset | 19.70 | 17.30 | 0 |
st.describe()
| recipient | site_duration | spend | register | |
|---|---|---|---|---|
| count | 3400.000000 | 3400.000000 | 3400.000000 | 3400.000000 |
| mean | 1700.500000 | 19.381494 | 16.461041 | 0.376176 |
| std | 981.639785 | 6.559998 | 2.707941 | 0.484496 |
| min | 1.000000 | 9.300000 | 7.400000 | 0.000000 |
| 25% | 850.750000 | 11.300000 | 14.900000 | 0.000000 |
| 50% | 1700.500000 | 21.500000 | 16.600000 | 0.000000 |
| 75% | 2550.250000 | 24.400000 | 17.820000 | 1.000000 |
| max | 3400.000000 | 38.510000 | 27.120000 | 1.000000 |
To perform A/B testing between the three options we will use the binomial z-test. We will use A to B, A to C, and B to C testing to see if there is a significant difference in site duration, spend, and registered attendees.
A: Sunset
B: Main St.
C: Waterslide
Visual Inspection
sns.histplot(data=st, x='site_duration', hue='pic_seen', kde=False, bins=50)
plt.show()
sns.histplot(data=st, x='spend', hue='pic_seen', kde=False, bins=50)
plt.show()
There seem to be differences in mean between all three picture options for both the spend and the site_duration variable.
A/B Testing
from scipy import stats
t, p = stats.ttest_ind(st.loc[st['pic_seen'] == "Sunset", 'spend'].values,st.loc[st['pic_seen'] == "Main St", 'spend'].values, equal_var=False)
print("T-stat: ", t)
print("P-Value: ", p)
T-stat: 41.02687256136666 P-Value: 1.7008147207546044e-248
This T test indicates that we cannot, with 95% certainty say that there is a statistically significant difference between spending of attendees when they see the Sunset picture vs. when they see the Main St. picture.
sunset_spend = st[st['pic_seen'] == 'Sunset']['spend']
main_spend = st[st['pic_seen'] == 'Main St']['spend']
water_spend = st[st['pic_seen'] == 'Waterslide']['spend']
t_test_BC = stats.ttest_ind(main_spend, water_spend)
print("Main St to Waterfall", t_test_BC)
t_test_AB = stats.ttest_ind(sunset_spend, main_spend)
print("Sunnset to Main St", t_test_AB)
t_test_AC = stats.ttest_ind(sunset_spend, water_spend)
print("Sunnset to Waterfall", t_test_AC)
Main St to Waterfall Ttest_indResult(statistic=-48.202035799343925, pvalue=0.0) Sunnset to Main St Ttest_indResult(statistic=40.53370629078891, pvalue=2.4333914152341534e-270) Sunnset to Waterfall Ttest_indResult(statistic=-23.596074151014385, pvalue=3.5328205740249146e-110)
For the spend variable, there is a significant difference in spending outcome, depending on wether the picture shown in the email is Waterslide or Mainstreet. There are no significant differences between the other options for pictures.
sunset_site = st[st['pic_seen'] == 'Sunset']['site_duration']
main_site = st[st['pic_seen'] == 'Main St']['site_duration']
water_site = st[st['pic_seen'] == 'Waterslide']['site_duration']
t_test_BC = stats.ttest_ind(main_site, water_site)
print("Main St to Waterfall", t_test_BC)
t_test_AB = stats.ttest_ind(sunset_site, main_site)
print("Sunnset to Main St", t_test_AB)
t_test_AC = stats.ttest_ind(sunset_site, water_site)
print("Sunnset to Waterfall", t_test_AC)
Main St to Waterfall Ttest_indResult(statistic=112.12607762440105, pvalue=0.0) Sunnset to Main St Ttest_indResult(statistic=-9.911415184705456, pvalue=1.0724418388111107e-22) Sunnset to Waterfall Ttest_indResult(statistic=180.71977956193996, pvalue=0.0)
When recipients see the Waterslide picture in the email their stay on the site is significantly shorter than when they see the Main St. or Waterfall picture.
Binomial Z-Test for 'register' variable
st['pic_seen'].value_counts()
Main St 1148 Waterslide 1142 Sunset 1110 Name: pic_seen, dtype: int64
st['register'].value_counts(normalize=True)
0 0.623824 1 0.376176 Name: register, dtype: float64
st[['pic_seen', 'register']].value_counts()
pic_seen register
Main St 0 756
Waterslide 0 744
Sunset 0 621
1 489
Waterslide 1 398
Main St 1 392
dtype: int64
prop_compare = st.groupby('pic_seen')[['register']].mean()
prop_compare
| register | |
|---|---|
| pic_seen | |
| Main St | 0.341463 |
| Sunset | 0.440541 |
| Waterslide | 0.348511 |
pic_spend = st.groupby('pic_seen')[['spend']].mean()
pic_spend
| spend | |
|---|---|
| pic_seen | |
| Main St | 14.016289 |
| Sunset | 16.781892 |
| Waterslide | 18.606778 |
from statsmodels.stats.proportion import proportions_ztest
table = pd.crosstab(st['pic_seen'].isin(['Waterslide', 'Main St']), st['register'])
stat, pval = proportions_ztest(table.iloc[:, 1], table.sum(axis=1))
print("Test statistic:", stat)
print("p-value:", pval)
Test statistic: 5.393864291006912 p-value: 6.895830810111409e-08
table = pd.crosstab(st['pic_seen'].isin(['Waterslide', 'Sunset']), st['register'])
stat, pval = proportions_ztest(table.iloc[:, 1], table.sum(axis=1))
print("Test statistic:", stat)
print("p-value:", pval)
Test statistic: -2.9832686777741704 p-value: 0.0028518752251626817
table = pd.crosstab(st['pic_seen'].isin(['Sunset', 'Main St']), st['register'])
stat, pval = proportions_ztest(table.iloc[:, 1], table.sum(axis=1))
print("Test statistic:", stat)
print("p-value:", pval)
Test statistic: -2.368186457800072 p-value: 0.017875524587415727
For statistical testing of this project, we conducted an AB Test on the 'promo_pics' dataset. After we imported the dataset, we checked the distribution of the variables used as predictors in the model; spend (amount spent) and the site_duration (time on website). These variables followed a normal distribution, which is important to increase the model's reliability. We then conducted t-tests on the three pictures with the spending variable. The test determined whether the two variables have significantly different means and used a p-value threshold of 0.05. All three tests showed a p-value of less than 0.05; Main St to Waterfall (0); Sunset to Main St (2.43e-270), and Sunset to Waterfall (3.53e-110), we can conclude the difference did not happen coincidentally, and there is a significant difference. If Lobsterland wants to increase spend, they should show the Waterslide picture in the email.
The second t-test, based on the variable site duration, showed p-values for Main St to Waterfall and Sunnset to Waterfall, with Sunnset to Main St showing a value of 1.072-22. This tells us the mean differences for site duration are not coincidental, and the picture impacts the consumer's decision-making process. In order to achieve the longest site duration, Lobsterland should show the Main St. picture.
The next step in the statistical test is determining which image has the highest percentage of registered people. The results showed 34.1% Main St, 44.05% Sunset, and 34.9% Waterslide. We had to determine whether or not the correlation between registering and the picture was statistically significant. Thus, we conducted a binomial z-test between each photo and the register output and found all p-values were less than 0.05, so we can reject the null hypothesis that there is no difference in customer's registering given each photo and accept that there is a statistical difference, which shows Sunset as the most influential photo. Therefore, Lobsterland should use Sunset to increase the number of people who register for the conference.
In conclusion, the insights gathered from the conjoint analysis, segmentation, and targeting, as well as the various recommendations provided for Lobster Land, all serve to help create a more appealing and successful seaside amusement park. By focusing on providing the right amenities at the best value per marginal cost, offering scalable services, and ensuring a consistent, high-quality customer experience, Lobster Land can effectively cater to its target audience while maintaining a strong brand identity.
Furthermore, it is essential for Lobster Land to keep a customer-centric approach and leverage data to make informed decisions for future improvements. By creating seasonal events, promoting sustainability, and engaging with the local community, Lobster Land can enhance its reputation and foster long-term success. These findings and recommendations not only have the potential to improve Lobster Land's operations and visitor experiences but also provide valuable insights into the process of analyzing and strategizing for an amusement park business. By applying these insights, Lobster Land can continue to grow and thrive as a popular destination for both local and international guests.
To ensure success, Lobster Land should focus on emphasizing its unique selling proposition (USP) by highlighting the distinctive seaside atmosphere, affordability, and family-friendly experiences. It is crucial to provide the right amenities at the best value per marginal cost, offer scalable services during peak and off-seasons, and maintain a consistent, high-quality customer experience. Additionally, Lobster Land should engage with the local community, create seasonal events, and promote sustainability initiatives. By leveraging data-driven insights and keeping a customer-centric approach, Lobster Land can foster long-term success and appeal to a broad range of visitors.